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A New Recursive Least-Squares Method 
with Multiple Forgetting Schemes 


Francesco Fraccaroli, Andrea Peruffo and Mattia Zorzi 


Abstract —We propose a recursive least-squares method with 
multiple forgetting schemes to track time-varying model pa¬ 
rameters which change with different rates. Our approach 
hinges on the reformulation of the classic recursive least-squares 
with forgetting scheme as a regularized least squares problem. 
A simulation study shows the effectiveness of the proposed 
method. 


I. Introduction 

Recursive identification methods are essential in system 
identification, [13], [25], [7], [9], [20], [11], In particular, 
they are able to track variations of the model parameters 
over the time. This task is fundamental in adaptive control, 
[1], [12], [23], 

Recursive least-squares (RLS) methods with forgetting 
scheme represent a natural way to cope with recursive iden¬ 
tification. These approaches can be understood as a weighted 
least-squares problem wherein the old measurements are ex¬ 
ponentially discounted through a parameter called forgetting 
factor. Moreover, in [3] their tracking capability has been 
analysed in a rigorous way. 

In this paper, we deal with models having time-varying 
parameters which change with different rates. Many appli¬ 
cations can be placed in this framework. An example is the 
automation of heavy duty vehicles, [21], In this problem, it 
is required to estimate the vehicle mass and the road grade. 
The former is almost constant over the time, whereas the 
latter is time-varying. Other examples are the control of strip 
temperature for heating furnace, [24], and the self-tuning 
cruise control, [14], 

In those applications the RLS with forgetting scheme 
provides poor performances. A refinement of this method 
is the RLS with directional forgetting scheme, [6], [8], [2], 
[4]. Roughly speaking, such approach fixes the problem that 
the incoming information is not uniformly distributed over 
all parameters. However, this nonuniformity is not equivalent 
to the presence of parameters with different changing rates, 
[21], Indeed, it is possible to construct models with pa¬ 
rameters having different changing rates and with incoming 
information uniformly distributed over all parameters. Thus, 
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also RLS with directional forgetting scheme provides poor 
performances. 

An ad-hoc remedy to estimate parameters with different 
changing rates is the RLS with vector-type forgetting (or 
selective forgetting) scheme, [19], [18], [15], [16]. The idea 
of the above method is to introduce many forgetting factors 
reflecting the different rates of the change of the parameters. 
Finally, an ad-hoc modification of the above method has been 
presented in [21], 

In this paper, we propose a new RLS with multiple for¬ 
getting schemes. Our method is based on the reformulation 
of the classic RLS with forgetting scheme as a regularized 
least-squares problem. It turns out that the current parameters 
vector minimizes the current prediction error plus a penalty 
term. The latter is the weighted distance between the current 
and the previous value of the parameters vector. Moreover, 
the weight matrix is updated at each time step and the 
updating law depends on the forgetting factor. This simple 
observation leads us to generalize this updating to multiple 
forgetting factors reflecting the different changing rates of 
the parameters. Moreover, we provide three updating laws 
drawing inspiration on machine learning. For simplicity 
we will consider SISO models because the extension to 
MIMO ones is straightforward. Finally, simulation show the 
effectiveness of our method. 

The remainder of the content in the paper is organized as 
follows. In Section [II] we present the state of the art about 
RLS with forgetting scheme and with vector-type forgetting 
scheme. The reformulation of the RLS and the three different 
updating laws are explained in Section III The performance 
comparisons between these methods are illustrated in Section 
ITvl Conclusions are drawn in Section [V] 


II. State of the art 

Consider a SISO linear, discrete time, time-varying, sys¬ 
tem 

At (z~ 1 )y(t) = B t {z~ l )u(t) + e(f), (1) 

where e(t) is additive noise with variance a 1 and u{t) is a 
stationary Gaussian process independent of e(t). 

Atiz- 1 ) and are time-varying polynomials whose 

degrees are n and to respectively: 


n 


A t (z J) = i+y^at.jz \ 

i= 1 

(2a) 

m 

B t (z- 1 ) = J2\*z-\ 

i =1 

(2b) 



where z is the shift operator. 
Assume to collect the data 


Z N ~{y(l),u(l)...y(N),u(N)}. (3) 


We would estimate At(z 1 ) and Bt(z 1 ) at each time step 
t given Z L We define 

= [—<2.*,! • • • — at,n bt, 1 • ■ • bt t m] T (4) 


as the vector containing the parameters of At{z 1 ) and 
B t (z~ i ). Let denote the regression matrix 




v(t) T 


ip(max(n, to) + 1) T 


(5) 


where ip(t) = [y(t — 1)... y{t — n) u(t — 1)... u(t — m)] T . 
Let y t be the vector of observations 


yt = [y(t) ■ ■ ■y{max(n,m) + 1)] T (6) 


and in similar way e t be the noise vector 

e t = [e(t)... e(max(n, to) + 1)] T . (7) 


A common way to solve such a problem relies on the RLS 
with forgetting scheme, [13], [25], where 9 t is given by 

9 t = argmin ]/(0 t ,f), (8) 

Bt 

and the loss-function is 

t 

v(9,t) = J2* t ~ s (y(s)-‘p(s) T 0)- ( 9 ) 

S=1 

Here, the forgetting factor A £ [0,1] operates as an exponen¬ 
tial weight which decreases for the more remote data. 
Problem ([8]) admits the recursive solution 

Rt = XRt-i + p{t)p(t) T , (10a) 

9 t = (9 t _! + R~ l y{t)(y(t) - <p(t) T 9 t _ i). (10b) 

Moreover, if we define P t = 1 we obtain the equivalent 

recursion 


this case the parameters to estimate are two. Such method 
conceptually separates the error due to the parameters in two 
parts in the objective function (|9]>, that is one part contains 
the error due to the parameter with faster changing rate and 
the second one the error due to the parameter with slower 
changing rate. Then two different forgetting factors have 
been applied for each term. 


III. RLS WITH MULTIPLE FORGETTING SCHEMES 


In this Section, we introduce our RLS for model whose pa¬ 
rameters have different changing rates. Our approach hinges 
on the following observation. 

Proposition 3.1: Problem ([8]) is equivalent to the follow¬ 
ing problem: 

9 t = argmin (y(t) - tp{t) T 9 t ) 2 + 
e t 

+ \(9 t ~9 t _ 1 ) T R t _ 1 (9 t -9 t - 1 ), (13) 


with updating law ( |10a| i. 

The proof is given in Appendix |A| 

Proposition |3.1| shows that the RLS with forgetting scheme 
can be understood as regularized least squares problem. More 
precisely, the first term in the objective function minimizes 
the prediction error at time t, whereas the penalty term 
minimizes the distance between 9 t and the previous estimate 
9 t -1 according to the weight matrix A/L_i. Moreover, the 
weight matrix is updated according to the law (10a i. 

It is then natural to allow a more general structure for the 
weight matrix XRt-i and its updating law (10ai. Let F\(-) 
be the forgetting map defined as follows 


'■ Sp Sp 

Rt- 1 l— > F\(Rt— 1)> 


where denotes the cone of positive definite matrices of 
dimension p and A = [Ai ...A p} 1 £ R p is the forgetting 
vector with 0<Aj<li = l...p forgetting factor of the 
7-th parameter. 

Therefore, given 9 t _ i, we propose the following estima¬ 
tion scheme for 9t 


9 t = 9 t -1 + K t (y(t ) - <p{t) T 9 t _ i), 
K = Pt-Mt ) 

* A + ip(t) T P t -iip(t) ’ 

Pt = j(I ~ K t p(t) T )P t _ i. 


(11a) 

(lib) 

(11c) 


In the case that the parameters in ARX model ([T]) vary with 
a different rate it is desirable to assign different forgetting 
factors. The RLS with vector-type forgetting scheme, [18], 
[15], consists of scaling P t by a diagonal matrix A of 
forgetting factors 


P t =A- 1 i(I-K t p{t) T )P t - 1 A-i (12) 


where A = diag{X\... X p ) with p = n + m. Therefore, A^ is 
the forgetting factor reflecting the changing rate of the 7-th 
parameter. Finally, an ad-hoc modification of the update law 
for the gain K t of the RLS has been proposed in [21], In 


9 t = argmin (y(t) - p(t) T 0 t ) 2 + 

B t 

+ {0 t - 4-i) T FA(i? t -i)(0 t - d4a) 

Rt = F^Rt.J + p(t)p(t) T . (14b) 

Proposition 3.2: The solution to ( | 1 4a[ > with updating law 
( |14b| ) admits the recursive solution 

9 t = 9 t -1 + K t (y(t) - ip(t) T 9 t - 1 ), 

K t = Rt\{t), 

Rt = Fx(R t -i) + p{t)p(t) T . 

Moroever, K t can be updated in the equivalent way: 

K F x (p-\)~Mt) 

‘ i+v(t)TF X (p-\)-wty 
P t = (I-K t v(t) T )F x (P t -_\r 1 


(15a) 

(15b) 

(15c) 


(16a) 


(16b) 








where P t = Ii f 1 . 

The proof is given in Appendix |B| 

To design the forgetting map F\ we consider the following 
result whose proof can be found in [17]. 

Proposition 3.3: Consider A,B £ <?+. Let C be a sym¬ 
metric matrix of dimension p such that 


and 0t:>. Therefore, we weigh R t - 1.12 with the forgetting 
factor A 2 


F\,Tc(Rt-i) 


X\Rt-l.l A2i?t-l,12 

A2-Rt-l,12 ^2Rt-l,2 


Moreover, the corresponding penalty term is 


[C\i3 = \MiA B \ih i,j = l...p. (17) 

Then, C £ Sj~. 

In view of the above result, a natural stmcture for F\ 
would be 

[F x (Rt- i)]ij = [Rt-i]ij[Qx]ij (18) 

where Q\ £ N+. Note that, Q\ can be understood as 
a kernel matrix with hyperparameters A in the context of 
machine learning, [17], [22], Next, we design three types 
of maps drawing inspiration on the diagonal kernel, the 
tuned/correlated kernel, [5], and the cubic spline kernel, [22], 


A. Diagonal updating 

Consider the ARX model o with m = 1 and n = 1 , 
therefore we only have two parameters. Let 9t. 1 and 0 t ,2 
denote the parameter of At{z~ x ) and S t (z _1 ), respectively. 
Moreover, the vector containing the two parameters is de¬ 
fined as 9t = [ 9t, 1 9t ,2 ] ■ We assume that the changing 
rate of 9 t ,i is slow over the interval [1, N], whereas the 
changing rate of 9 t g is faster. The simplest idea is to 
decouple the parameters in the penalty term in ( | 14a| >. We 
associate the forgetting factor Ai to 9 t i and A 2 to 9 t2 with 
Ai > A 2 - Let 


Rt -1 — 

Then, if we define 

1) = 


Rt- 1,1 Rt- 1,12 
Rt- 1,12 Rt-i ,2 


Ai-R*—1,1 0 

0 A2i?t-1,2 


(19) 


( 20 ) 


the penalty term in (14a 1 becomes 

Xi(9 tl — 0 t _i,i) 2 i? t _i,i + A 2 (0t,2 — 9 t -ig) 2 Rt-i : 2 (21) 


that is the parameters of A^z -1 ) and the ones of B t (z~ 1 ) 
have been decoupled in the penalty term. 

This simple example leads us to consider the diagonal 
updating 


[F\,Dl{Rt-l)]i,j 


r 0 if Xj 

\ [R t -i]ijXi otherwise 


Finally, it is worth noting that in the special case that p = 2 
we obtain the method proposed in [ 21 , formulae ( 22 ) and 
(23)]. 


B. Tuned/Correlated updating 


We consider again the example of Section III-A The 
changing rate of R t - 1,12 depends on the changing rates of 
9 t \ and 9 f g. Hence, it is reasonable to forget past values of 
R t ~ 1.12 with the fastest changing rate between the one of 9 t . 1 


Ai( 0 t,i — 9 t -\p ) 2 Rt-1,1 + A2 { 9 tg — Ot-1,2) 2 Rt-1,2 
+2A2(0t,i — 9 t-ip)( 9 tg — & t—1,2) Rt — 1,12 ■ ( 22 ) 

Thus, the weight of the cross term is dominated by the 
smallest forgetting factor. Therefore, in the general case, a 
reasonable updating law is: 

[F\,Tc(Rt-i)]i,j =m.m(X i ,X j )[R t --i]ij. (23) 


C. Cubic Spline updating 


Consider the example of Section III-A We want to con¬ 
struct an updating such that the weight of the cross term 
in the penalty term ( |14a| > is not totally dominated by the 
forgetting factor A 2 . More precisely, we want that the weight 
of the cross term is also influenced by Ai. We consider Q\ 
as a cubic spline like kernel matrix 



(24) 


where l \. l 2 > 0 , i = 1 , 2 , is a function of i to be determined. 
In our case we want that 

[Qxh = | (25) 

is equal to A; for i = 1,2. Therefore, we obtain 

h = \/3A•, i = 1,2. (26) 


In this way, we built a forgetting map whose cross term is 
penalized by a blend of Ai and A 2 . 

Remark 3.1: One could also consider the matrix Q\ such 
that [Q\]ji = \JXiXj, i,j = 1 ...p. To compare (24 1 and 
Qx assume that Xi is fixed equal to 0.3, whereas A 2 can vary 
over the interval [0,1]. In Figure [I] we depict the functions 

/(A 2 ) = l ~2 ((2 — 3 O and g(X 2 ) = VXiX 2 . As one can see 
/(•) takes smaller values than the ones of gf) , that is the 
influence of the smallest forgetting factor is more marked in 
/(•)• 


Thus, by plots evidence, ( [24] > provides a blend of A | and 
A 2 in which the influence of A 2 (forgetting factor associated 
to the parameter with the fastest changing rate) is more 
marked than the one in Q\. 

In the general case, therefore the updating law becomes 


[F\,cs(Rt— •—[7?t— 


x min 


2 




where = ^3Ai, i=l...p. 



















A2 


Fig. 1. Comparison between /(•) and g(-) 


IV. Simulations Results 
In this section we analyse the performance of the RLS with 


multiple forgetting schemes that we presented in Section III 


The experiment has been performed using MATLAB as the 
numerical platform. 


A. Data generation 

We consider a discrete-time, time-varying ARX model 
described in <[!}, with n = 2 and m = 2. Here, the parameters 
in At{z _1 ) vary faster than the ones in £> t (z _1 ). To this 
aim, nine stable polynomials j = 1... 9, and two 

stable polynomials S( fc )(z _1 ), k = 1,2, have been defined. 
We considered the time interval [l,iV] with N = 160. The 
polynomial Bt(z ~ i ) is generated as a smooth time varying 
convex combination of S (1 ^(z _1 ) and B^ 2 \z~ 1 ). Regarding 
At (z _1 ), we split the interval [1, N] in eight sub-interval and 
at the j-th interval At(z ^ [ ) is generated as a smooth time 
varying convex combination of A^(z _1 ) and A^ +1 \z~ l ). 

Finally, the input u(t) is generated as a realization of 
white Gaussian noise with unit variance and filtered with a 
10 th order Butterworth low-pass filter. Starting from random 
initial condition, the output y{t) is collected and corrupted by 
an additive white Gaussian noise with variance a 2 = 0.01. 


B. Proposed Methods 

The method we consider are: 

• RARX: this is the classic RARX algorithm implemented 
in rarx.m in the MATLAB System identification 
Toolbox, [10]; 

• VF: this is the RLS with vector-type forgetting scheme 
described at the end of Section [IH 

• DI: this is the RLS algorithm with diagonal updating of 
Section |III-A[ 

• TC: this is the RLS algorithm with tuned/correlated 
updating of Section [III-B[ 

• CS: this is the RLS algorithm with cubic spline updating 
of Section UlI-CI 


For each method m = 2 and n = 2, that is the estimated 
ARX models have the same order of the true one. Regarding 
VF, DI, TC and CS we set 

A = [ Ai Ai A 2 A 2 ] (28) 

that is Ai is the forgetting factor for the parameters in 
Atiz^ 1 ) and A 2 is the forgetting factor for the parameters 
in B,(z '). 


C. Experiment setup 

We consider a study of 500 r uns. Fo r each run, we generate 

. and we compute 9 t with 


IV-A 


the data as described in Section 
the five methods. More precisely, for each method (VF, DI, 
TC and CS) we compute 9 t for twenty values of Ai and A 2 
uniformly sampled over the interval [0.1,1]. Then, we pick 
AJ and A£ which maximize the one step ahead coefficient of 
determination (in percentage) 


COD = 1 - 


£E?=i m-m? 


££ LMQ-vn) 


57-02 


x 100 


(29) 


where y(t) is the predicted value of y(t) based on the 
ARX model with _4t_i(z -1 ) and Ht_i(z -1 ), and y N is the 
sample mean of the output data. It is worth noting that the 
performance index COD is used for time invariant models. 
On the other hand, it provides a rough idea whether the 
estimated model is good or not and it allows to choose 
reasonable values for A] 1 and A£. Then, for A° and A[] we 
compute the corresponding average track fit (in percentage) 


ATF =( 1 -^X !!? f ffl f ! ) x100 ' <30) 

Regarding RARX, we use the procedure above with one 
forgetting factor. 


D. Results 

In Figure [2] are shown the values of A. The first boxplot 
refers to the values chosen by the classic RARX algorithm, 
from the second to the fifth the values of the forgetting factor 
Ai referring to the parameters of At(z~ 1 ) are represented, 
while the last ones refer to the forgetting factor A 2 related to 
the parameters of £> t (z _1 ). Since the parameters of At(z _1 ) 
varies faster than the ones of Bt izA 1 ), its forgetting factors 
are smaller than the respective others, as expected. On the 
other hand, the classic RARX has not the possibility to 
choose different forgetting factors so its best choice is to 
take an intermediate value among the ones picked by the 
proposed algorithms. 

In Figure [3] are depicted the average track fit indexes. 
All the proposed algorithms have better performances than 
RARX and VF, anyway it is possible to highlight that the 
TC updating shows the best results. This fact suggests that 
the most efficient weight for the cross terms in the penalty 
term in (14a 1 is the smallest forgetting factor between the 
eligible ones, as occurs in the TC algorithm. 

Figure [4] illustrates the COD indexes. Once again the pro¬ 
posed algorithms outperforms the classic RARX method: if 


















RARX VF Dl TC CS VF Dl TC CS 


Fig. 2. Forgetting factors of the different algorithms. First column: forgetting factor of RARX. Second-fifth column: forgetting factor X\ for VF, Dl, TC 
and CS. Sixth-last column: forgetting factor A 2 for VF, Dl, TC and CS. 


we focus on the average value of the boxplots the difference 
is around 5%. In terms of outliers we can underline that 
RARX reaches —100% in the worst case scenario, while the 
proposed methods never go below —55%. 



Fig. 4. One step ahead coefficient of determination of the different 
algorithms. 


Fig. 3. Average track fit of the different algorithms. 


V. Conclusions 

We presented a reformulation of the classic RLS algo¬ 
rithm, which can be split into the minimization of the current 
prediction error and the minimization of a quadratic function 
which penalizes the distance between the current and the 
previous value of the estimate. This reformulation is strictly 
connected to an updating equation which provides the weight 
matrix of the quadratic function: to change the updating 
equation given by the classic algorithm means to substitute 
the map that connects the present weight matrix to the past 
one. This permits to model multiple forgetting factors to 
improve the estimation of parameters with different changing 
rates. 

In this paper we provide three different updating laws. 
Simulations show that these algorithms outperforms the 


conventional ones thanks to the proposed updating law which 
allows the presence of several forgetting factors. Therefore, 
multiple forgetting factors seem to be the key to a more 
efficient identification. It is worth noting that the challenging 
step is the choice of such forgetting factors. Therefore, the 
next research direction will concern the estimation of such 
parameters from the collected data. 


Appendix 


A. Proof of Proposition 3.1 


Let Q t = diag{ 1... A 4 1 ). Consider 
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